IDFraIP:A Novel Protein Identification Algorithm Based on Fragment Intensity Patterns
نویسندگان
چکیده
A Identifying peptides for their fragmentation spectra by database search sequencing method is crucial to interpret LC-MS/MS data, widely used algorithms had not been fully exploited the intensity patterns in fragment spectra, SQID incorporated intensity information and identified peptides significantly more peptides than Sequest and X!Tandem. Although SQID adopted various datasets which based on different platforms to show its robustness and effectiveness, many other characterizes were not considered. This article utilized intensity pattern modeling which had been reported by SQID, proposed a novel scoring model to identify fragment spectra. Compared with SQID and Sequest at 1% False Discovery Rate (FDR), IDFraIP identified more confident peptides and spectra. Introduction Tandem mass spectrometry (MS/MS) represented a pioneer role for examining the activities and functional states of proteins[1]. In proteomics experiments, large numbers of MS/MS fragment spectra generated, how to interpret and extract high confidence peptides for experimental spectra is crucial to proteomics studies [2, 5, 10-11]. Hence, identifying large-scale spectra by virtue of protein identification algorithms are necessary [5, 7]. Most of the identification algorithms reported in the literature used database search sequencing method, the most important of the above algorithms is to determine similarity between experiment spectra and theoretical spectra [1, 2, 3]. Currently protein identification algorithms primarily utilize predicted fragment m/z value to assign peptide sequences for MS/MS spectra [5, 9, 11], including Sequest [12], X!Tandem [8], Mascot [6]. Intensity information was rarely considered, SQID [5] demonstrated that intensity pattern modeling could improve the number of credible identified peptides and spectra. At the same time, SQID showed us an effective ideology to establish algorithm model [1, 4-5, 9]. Scoring function is the nucleus of peptide identification algorithms [9]. We accorded to the intensity pattern model reported by SQID [5], furtherly rebuilt a novel protein identification algorithm, named IDFraIP. In order to validate the accuracy and robustness of IDFraIP, we compared with SQID and Sequest via various datasets which produced from different platforms at 1% FDR, showing its higher identification and accuracy. International Conference on Materials Engineering and Information Technology Applications (MEITA 2015) © 2015. The authors Published by Atlantis Press 601 Materials and Methods MS/MS Datasets. Standard mixtures of 18 proteins from two types of instruments: Thermo Finnigan LTQ-FT and Micromass/Waters QTOF Ultima, abbreviated FT and QTOF, respectively, the datasets could been downloaded from the following web site: https://regis-web.systemsbiology.net//PublicData sets/. The data sets of the E.coli proteome spectra downloaded from http://marcottelab.org/MSdata/Data_03/. S. pneumoniae D39 data as training dataset that contains more than 270,000 spectra which obtained from http://bioinformatics.jnu.edu.cn/software/proverb/ . Data Preprocessing. The raw format files of S.pneumoniae D39 and E.coli needed to convert to dta format files by Bioworks 3.31. when utilized Mascot software to search, the dta format files needed to merge Mascot generic format (MGF) by merge.pl program. The dta format files as the input files of this article method and Sequest software. Peaks Selecting. Isotope peaks could increase the false positive rate (FPR), removing isotope peaks was needful, the method of removing isotope peaks in this article was as follows: if two peaks closer than 1 0.25 Da ± are considered as isotope peaks, the weaker intensity of the peak would be removed. Meantime, various algorithms provided diverse methods to select effective peaks, SQID and Sequest selected the strongest 80 and 200 peaks from all fragment spectra respectively. While OMMSA select the 50 most peaks from the spectra. Here, we divided the spectra into several bins by 100 Da length and then selected the top six ion peaks in each bin. False Discovery Rate (FDR). The identified peptides which scores with rank1 PSMs of all spectra needs to be calculated false discovery rate by Kall’s method. And the specific formula as follows: . . arg no of decoy PSMs above threshold FDR no of t et PSMs above threshold = (1) Scoring Model. Experimental spectra are assigned peptides by scoring against a list of candidate peptides. In protein identification scoring model, the essential aspect is how to evaluate the match level of experimental spectra against theoretical spectra. In order to put forward a reasonable scoring model, we utilized various characterizes to evaluate matching effect, applied Poisson distribution model and considered three aspects: consecutive ions pairs match and b/y ions match:
منابع مشابه
Context-sensitive markov models for peptide scoring and identification from tandem mass spectrometry.
Peptide and protein identification via tandem mass spectrometry (MS/MS) lies at the heart of proteomic characterization of biological samples. Several algorithms are able to search, score, and assign peptides to large MS/MS datasets. Most popular methods, however, underutilize the intensity information available in the tandem mass spectrum due to the complex nature of the peptide fragmentation ...
متن کاملProtein Identification Algorithms Developed from Statistical Analysis of MS/MS Fragmentation Patterns
Tandem mass spectrometry is widely used in proteomic studies because of its ability to identify large numbers of peptides from complex mixtures. In a typical LCMS/MS experiment, thousands of tandem mass spectra will be collected and peptide identification algorithms are of great importance to translate them into peptide sequences. Though these spectra contain both m/z and intensity values, most...
متن کاملNGTSOM: A Novel Data Clustering Algorithm Based on Game Theoretic and Self- Organizing Map
Identifying clusters is an important aspect of data analysis. This paper proposes a noveldata clustering algorithm to increase the clustering accuracy. A novel game theoretic self-organizingmap (NGTSOM ) and neural gas (NG) are used in combination with Competitive Hebbian Learning(CHL) to improve the quality of the map and provide a better vector quantization (VQ) for clusteringdata. Different ...
متن کاملSQID: an intensity-incorporated protein identification algorithm for tandem mass spectrometry.
To interpret LC-MS/MS data in proteomics, most popular protein identification algorithms primarily use predicted fragment m/z values to assign peptide sequences to fragmentation spectra. The intensity information is often undervalued, because it is not as easy to predict and incorporate into algorithms. Nevertheless, the use of intensity to assist peptide identification is an attractive prospec...
متن کاملA Novel Methodology for Structural Matrix Identification using Wavelet Transform Optimized by Genetic Algorithm
With the development of the technology and increase of human dependency on structures, healthy structures play an important role in people lives and communications. Hence, structural health monitoring has been attracted strongly in recent decades. Improvement of measuring instruments made signal processing as a powerful tool in structural heath monitoring. Wavelet transform invention causes a g...
متن کامل